Explore or Exploit? Effective Strategies for Disambiguating Large Databases

نویسندگان

  • Reynold Cheng
  • Eric Lo
  • Xuan S. Yang
  • Ming-Hay Luk
  • Xiang Li
  • Xike Xie
چکیده

Data ambiguity is inherent in applications such as data integration, location-based services, and sensor monitoring. In many situations, it is possible to “clean”, or remove, ambiguities from these databases. For example, the GPS location of a user is inexact due to measurement errors, but context information (e.g., what a user is doing) can be used to reduce the imprecision of the location value. In order to obtain a database with a higher quality, we study how to disambiguate a database by appropriately selecting candidates to clean. This problem is challenging because cleaning involves a cost, is limited by a budget, may fail, and may not remove all ambiguities. Moreover, the statistical information about how likely database objects can be cleaned may not be precisely known. We tackle these challenges by proposing two types of algorithms. The first type makes use of greedy heuristics to make sensible decisions; however, these algorithms do not make use of cleaning information and require user input for parameters to achieve high cleaning effectiveness. We propose the Explore-Exploit (or EE) algorithm, which gathers valuable information during the cleaning process to determine how the remaining cleaning budget should be invested. We also study how to fine-tune the parameters of EE in order to achieve optimal cleaning effectiveness. Experimental evaluations on real and synthetic datasets validate the effectiveness and efficiency of our approaches.

منابع مشابه

6. Conclusion 5. Experimental Observations 4. Related Work

Much further research is needed on explore/exploit strategies. The issue is crucial for true autonomy—con-sider the Mars robot or any system in a substantially unknown environment that cannot be teleoperated. This paper suggests that strategies based on error or its rate of change can permit systems to control the tradeoff autonomously. Major desiderata now are further experi-ments—to build up ...

متن کامل

Focusing Strategies for Multiple Fault Diagnosis

Diagnosing multiple faults for a complex system is often very difficult. It requires not only a model which adequately represents the diagnostic aspect of a complex system, but also an efficient diagnostic algorithm that can generate effective test and repair recommendations. One way of developing such an efficient and effective diagnostic algorithm is to focus the computational resource on dis...

متن کامل

What is the nature of decision noise in random exploration?

The explore-exploit tradeoff is a fundamental behavioral dilemma faced by all adaptive organisms. Should we explore new options in the hopes of finding a better meal, a better house or a better mate, or should we exploit the options we currently believe to be best? Striking the right balance between exploration and exploitation is hard computational problem and there is significant interest in ...

متن کامل

On the value of soil moisture measurements in vadose zone hydrology: A review

[1] We explore and review the value of soil moisture measurements in vadose zone hydrology with a focus on the field and catchment scales. This review is motivated by the increasing ability to measure soil moisture with unprecedented spatial and temporal resolution across scales. We highlight and review the state of the art in using soil moisture measurements for (1) estimation of soil hydrauli...

متن کامل

Exploiting Lexical Resources for Disambiguating CJK and Arabic Orthographic Variants

The orthographical complexities of Chinese, Japanese, Korean (CJK) and Arabic pose a special challenge to developers of NLP applications. These difficulties are exacerbated by the lack of a standardized orthography in these languages, especially the highly irregular Japanese orthography and the ambiguities of the Arabic script. This paper focuses on CJK and Arabic orthographic variation and pro...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل
عنوان ژورنال:
  • PVLDB

دوره 3  شماره 

صفحات  -

تاریخ انتشار 2010